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Abstract 

The  Lasso  (Tibshirani,  1996)  is  an  attractive  technique  for  regularization  and  vari¬ 
able  selection  for  high-dimensional  data,  where  the  number  of  predictor  variables  p  is 
potentially  much  larger  than  the  number  of  samples  n.  However,  it  was  recently  dis¬ 
covered  (Zhao  and  Yu,  2006;  Zou,  2005;  Meinshausen  and  Biihlmann,  2006)  that  the 
sparsity  pattern  of  the  Lasso  estimator  can  only  be  asymptotically  identical  to  the  true 
sparsity  pattern  if  the  design  matrix  satisfies  the  so-called  irrepresentable  condition. 

The  latter  condition  can  easily  be  violated  in  applications  due  to  the  presence  of  highly 
correlated  variables. 

Here  we  examine  the  behavior  of  the  Lasso  estimators  if  the  irrepresentable  condi¬ 
tion  is  relaxed.  Even  though  the  Lasso  cannot  recover  the  correct  sparsity  pattern,  we 
show  that  the  estimator  is  still  consistent  in  the  t^-norm  sense  for  fixed  designs  under 
conditions  on  (a)  the  number  sn  of  non-zero  components  of  the  vector  f3n  and  (b)  the 
minimal  singular  values  of  the  design  matrices  that  are  induced  by  selecting  of  order  sn 
variables.  The  results  are  extended  to  vectors  fd  in  weak  £9-balls  with  0  <  q  <  1.  Our 
results  imply  that,  with  high  probability,  all  important  variables  are  selected.  The  set 
of  selected  variables  is  a  useful  (meaningful)  reduction  on  the  original  set  of  variables 
(pn  >  n).  Finally,  our  results  are  illustrated  with  the  detection  of  closely  adjacent 
frequencies,  a  problem  encountered  in  astrophysics. 
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1  Introduction 


The  Lasso  was  introduced  by  Tibshirani  (1996)  and  has  since  been  proven  to  be  very  popular 
and  well  studied  (Knight  and  Fu,  2000;  Zhao  and  Yu,  2006;  Zou,  2005;  Wainwright,  2006). 
Some  reasons  for  the  popularity  might  be  that  the  entire  regularization  path  of  the  Lasso 
can  be  computed  efficiently  (Osborne  et  ah,  2000;  Efron  et  al.,  2004),  that  Lasso  is  able  to 
handle  more  predictor  variables  than  samples  and  produces  sparse  models  which  are  easy  to 
interpret.  Several  extensions  and  variations  have  been  proposed  (Yuan  and  Lin,  2005;  Zhao 
and  Yu,  2004;  Zou,  2005;  Meier  et  al.,  2006;  Candes  and  Tao,  2005b). 

1.1  Lasso- type  estimation 

The  Lasso  estimator,  as  introduced  by  (Tibshirani,  1996),  is  given  by 

px  =  argming  ||Y  -  Xf3\\\  +  A||/?|| £l,  (1) 

where  X  =  (Ad, . . . ,  Xp)  is  the  n  x  p  matrix  whose  columns  consist  of  the  n-dimensional 
fixed  predictor  variables  Xk,  k  =  1 , ,p.  The  vector  Y  contains  the  n-dimensional  set  of 
real-valued  observations  of  the  response  variable. 

The  distribution  of  Lasso- type  estimators  has  been  studied  in  Knight  and  Fu  (2000).  Vari¬ 
able  selection  and  prediction  properties  of  the  Lasso  have  been  studied  extensively  for  high 
dimensional  data  with  p  n,  a  frequently  encountered  challenge  in  modern  statistical 
applications.  Some  studies  (e.g.  Greenshtein  and  Ritov,  2004;  van  de  Geer,  2006)  have  fo¬ 
cused  mainly  on  the  behavior  of  prediction  loss.  Much  recent  work  aims  at  understanding 
the  Lasso  estimates  from  the  point  of  view  of  model  selection,  including  Meinshausen  and 
Biihlmann  (2006),  Donoho  et  al.  (2006),  Zhao  and  Yu  (2006),  Candes  and  Tao  (2005b)  and 
Zou  (2005).  For  the  Lasso  estimates  to  be  close  to  the  model  selection  estimates  when 
the  data  dimensions  grow,  all  the  aforementioned  papers  assumed  a  sparse  model  and  used 
various  conditions  that  state  that  the  irrelevant  variables  are  not  too  correlated  with  the 
relevant  ones.  Incoherence  is  the  terminology  used  in  the  deterministic  setting  of  Donoho 
et  al.  (2006)  and  “irrespresentability”  is  used  in  the  stochastic  setting  (linear  model)  of  Zhao 
and  Yu  (2006).  Here  we  focus  exclusively  on  the  properties  of  the  estimate  of  the  coefficient 
vector  under  squared  error  loss  and  try  to  understand  the  behavior  of  the  estimate  under 
a  relaxed  irrepresentable  condition  (hence  we  are  in  the  stochastic  or  linear  model  setting). 
The  aim  is  to  see  whether  the  Lasso  still  gives  meaningful  models  in  this  case. 

More  discussions  on  the  connections  with  other  works  will  be  covered  in  Section  1.5  after 
notions  are  introduced  to  state  explicitly  what  the  irrepresentable  condition  is  so  that  the 
discussions  are  clearer. 
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1.2  Linear  Model 

We  assume  a  linear  model  for  the  observations  of  the  response  variable  Y  =  {Y\ , . . . ,  Yn), 

Y  =  Xf3  +  e,  (2) 

where  e—  (gq, . . .  ,en)  is  a  vector  containing  independently  and  identically  distributed  noise 
with  E(si)  =  0.  When  there  is  a  question  of  nonidentihability  for  (3  when  p  >  n,  we  define 
/ 3  as 

d  =  argmin{/3:  EY=Xp}  \\P\W  •  (3) 

The  aim  is  to  recover  the  vector  f3  as  well  as  possible  from  noisy  observations  Y .  For  the 
equivalence  between  t\-  and  f'o-sparse  solutions  see  for  example  Gribonval  and  Nielsen  (2003); 
Donoho  and  Elad  (2003);  Donoho  (2006). 

1.3  Recovery  of  the  sparsity  pattern  and  the  irrepresentable  condition 

There  is  empirical  evidence  that  many  signals  in  high-dimensional  spaces  allow  for  a  sparse 
representation.  As  an  example,  wavelet  coefficients  of  images  often  exhibit  exponential 
decay,  and  a  relatively  small  subset  of  all  wavelet  coefficients  allow  a  good  approximation  to 
the  original  image  (Joshi  et  al.,  1995;  LoPresto  et  ah,  1997;  Mallat,  1989).  For  conceptual 
simplicity,  we  assume  in  our  regression  setting  Erst  that  the  vector  (3  is  sparse  in  the  Ai-sense 
and  many  coefficients  of  / 3  are  identically  zero  (this  will  later  be  relaxed).  The  corresponding 
variables  have  thus  no  influence  on  the  response  variable  and  could  be  safely  removed.  The 
sparsity  pattern  of  (3  is  understood  to  be  the  sign  function  of  its  entries,  with  sign(x)  =  0 
if  x  =  0,  sign(x)  =  1  if  x  >  0  and  sign(x)  =  —  1  if  x  <  0.  The  sparsity  pattern  of  a  vector 
might  thus  look  like 

sign(/5)  =  (+1,  — 1,  0,  0, +1, +1,  — 1, +1,  0,  0,...), 

distinguishing  whether  variables  have  a  positive,  negative  or  no  influence  at  all  on  the  re¬ 
sponse  variable.  It  is  of  interest  whether  the  sparsity  pattern  of  the  Lasso  estimator  is  a  good 
approximation  to  the  true  sparsity  pattern.  If  these  sparsity  patterns  agree  asymptotically, 
the  estimator  is  said  to  be  sign  consistent  (Zhao  and  Yu,  2006). 

Definition  1  (Sign  consistency)  An  estimator  (3X  is  sign  consistent  if  and  only  if 

P{sign((3 )  =  sign($)}  — >  1  n  — >  oo. 

It  was  shown  independently  in  Zhao  and  Yu  (2006),  Zou  (2005)  in  the  linear  model  case  and 
Meinshausen  and  Biihlmann  (2006)  in  Gaussian  graphical  model  setting  that  sign  consistency 
requires  a  condition  on  the  design  matrix.  The  assumption  was  termed  the  irrepresentable 
condition  in  Zhao  and  Yu  (2006).  Let  C  =  nlXTX.  The  dependence  on  n  is  neglected 
notationally. 
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Definition  2  (Irrepresentable  condition)  Let  K  =  {k  :  (3k  ^  0}  be  the  set  of  relevant 
variables  and  let  N  =  {1, . . .  ,p}  \  K  be  the  set  of  noise  variables.  The  sub-matrix  Chk 
is  understood  as  the  matrix  obtained  from  C  by  keeping  rows  with  index  in  the  set  H  and 
columns  with  index  in  K .  The  irrepresentable  condition  is  fulfilled  if  it  holds  element-wise 
that 

\CnkCk'k  sign^x)  \  <  1. 

In  Zhao  and  Yu  (2006),  an  additional  strong  irrepresentable  condition  is  defined  which  re¬ 
quires  that  the  above  elements  are  not  merely  smaller  than  1  but  are  uniformly  bounded 
away  from  1.  Zhao  and  Yu  (2006),  Zou  (2005)  and  Meinshausen  and  Biihlmann  (2006)  show 
that  the  Lasso  is  sign  consistent  only  if  the  irrepresentable  condition  holds. 

Proposition  1  (Sign  consistency  only  under  irrepresentable  condition)  Let  Assump¬ 
tions  1-4  in  Section  2.2  be  satisfied.  Assume  that  the  irrepresentable  condition  is  not  fulfilled. 
Then  there  exists  no  sequence  X  =  \n  such  that  the  estimator  (3Xn  is  sign  consistent. 

In  practice,  it  might  be  difficult  to  verify  whether  the  condition  is  fulfilled.  This  led  various 
authors  to  propose  interesting  extensions  to  the  Lasso  (Zhang  and  Lu,  2006;  Zou,  2005; 
Meinshausen,  2006).  Before  giving  up  on  the  Lasso  altogether,  however,  we  want  to  examine 
in  this  paper  in  what  sense  the  original  Lasso  procedure  still  gives  sensible  results,  even  if 
the  irrepresentable  condition  is  not  fulfilled. 

1.4  ^-consistency 

The  aforementioned  studies  showed  that  if  the  irrepresentable  condition  is  not  fulfilled,  the 
Lasso  cannot  select  the  correct  sparsity  pattern.  In  this  paper  we  show  that  the  Lasso  selects 
in  these  cases  the  non-zero  entries  of  [3  and  some  not-too-many  additional  zero  entries  of  (3 
under  relaxed  conditions  than  the  irrepresentable  condition.  The  non-zero  entries  of  (3  are 
in  any  case  included  in  the  selected  model.  Moreover,  the  size  of  the  estimated  coefficients 
allows  to  separate  the  few  truly  zero  and  the  many  non-zero  coefficients.  However,  it  is  worth 
noting  that  in  the  extreme  cases  when  the  variables  are  linearly  dependent,  these  relaxed 
conditions  will  be  violated  as  well.  In  these  situations,  it  is  not  sensible  to  use  the  ^-metric 
on  (3  to  assess  Lasso.  Other  metrics  are  to  be  investigated  in  our  future  research. 

Our  main  result  shows  the  ^-consistency  of  the  Lasso,  even  if  the  irrepresentable  condition 
is  relaxed.  To  be  precise,  an  estimator  is  said  to  be  /^-consistent  if 

11/3 -/%2-0  n  — >  oo.  (4) 

Convergence  rates  will  also  be  derived.  An  /^-consistent  estimator  is  attractive,  as  important 
variables  are  chosen  with  high  probability  and  falsely  chosen  variables  have  very  small  coef¬ 
ficients.  The  bottom  line  will  be  that  even  if  the  sparsity  pattern  of  (3  cannot  be  recovered 
by  the  Lasso,  we  can  still  obtain  a  good  approximation. 
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1.5  Related  work. 


Here  we  discuss  further  the  existing  works  on  Lasso  mentioned  earlier  on.  Prediction  loss  for 
high-dimensional  regression  with  Lipschitz  loss  functions  under  an  G -penalty  is  examined 
in  van  de  Geer  (2006).  Also  bounds  for  the  G-distance  between  the  vector  f3  and  its  Lasso 
estimate  are  derived.  Similar  interesting  results  as  in  van  de  Geer  (2006)  are  obtained  for 
random  designs  and  squared  error  loss  in  Bunea  et  ah  (2006b),  both  in  terms  of  prediction 
loss  and  G -metric.  A  difference  with  the  model  selection  results  in  these  papers  is  that  we  are 
able  to  obtain  ^-consistency  for  fixed  designs  even  if  the  sparsity  sn  (the  number  of  non-zero 
coefficients)  is  growing  almost  as  fast  as  n,  while  the  previous  results  need  sn  =  o(^/n),  partly 
because  they  require  a  data- independent  choice  of  the  penalty  parameter.  The  previous  study 
of  (Bunea  et  al.,  2006a)  considers  fixed  designs  (as  we  do),  and  obtains  very  nice  results, 
albeit  limited  to  the  setting  p  <  n,  while  we  are  interested  in  the  high-dimensional  case 
where  the  number  p  of  predictor  variables  is  possibly  very  much  larger  than  the  sample  size 
n. 

Moreover,  we  would  like  to  compare  the  results  of  this  manuscript  briefly  with  results  in 
Donoho  (2004)  and  Candes  and  Tao  (2005b).  These  papers  derive  bounds  on  the  G-norm 
distance  between  (3  and  (3  for  G-norrn  constrained  estimators.  In  Donoho  (2004)  the  de¬ 
sign  is  random  and  the  random  predictor  variables  are  assumed  to  be  independent.  The 
results  are  thus  not  directly  comparable  to  the  results  derived  here  for  general  fixed  designs. 
Nevertheless,  results  in  Meinshausen  and  Buhlmann  (2006)  suggest  that  the  irrepresentable 
condition  is  with  high  probability  fulfilled  for  independently  normal  distributed  predictor 
variables.  The  results  in  Donoho  (2004)  can  thus  not  directly  be  used  to  study  the  behav¬ 
ior  of  the  Lasso  under  a  violated  irrepresentable  condition,  which  is  our  goal  in  the  current 
manuscript. 

Candes  and  Tao  (2005b)  study  the  properties  of  the  so-called  “Dantzig  selector”,  which  is 
very  similar  to  the  Lasso,  and  derive  remarkably  sharp  bounds  on  the  G-distance  between 
the  vector  f3  and  the  proposed  estimator  /3.  The  results  are  derived  under  the  condition  of 
a  Uniform  Uncertainty  Principle  (LIUP),  which  was  introduced  in  Candes  and  Tao  (2005a). 
The  UUP  is  a  relaxation  of  the  irrepresentable  condition  and  is  very  similar  to  our  assump¬ 
tions  on  sparse  eigenvalues  in  this  manuscript.  It  would  be  of  interest  to  study  the  connection 
between  the  Lasso  and  “Dantzig  selector”  further,  as  the  solutions  share  many  similarities. 

2  Main  assumptions  and  results 

First,  we  introduce  the  notion  of  sparse  eigenvalues,  which  will  play  a  crucial  role  in  providing 
bounds  for  the  convergence  rates  of  the  Lasso  estimator.  Thereafter,  the  assumptions  are 
explained  in  detail  and  the  main  results  are  given. 
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2.1  Sparse  eigenvalues 

The  notion  of  sparse  eigenvalues  is  not  new  and  has  been  used  before  (Donoho,  2006);  we 
merely  intend  to  fixate  notation.  The  m-sparse  minimal  eigenvalue  of  a  matrix  is  the  minimal 
eigenvalue  of  any  m  x  m-dimensional  submatrix. 


Definition  3  The  m-sparse  minimal  eigenvalue  and  m-sparse  maximal  eigenvalue  of  C  are 
defined  as 


Vin  [m)  =  mm 

P-\\P\\i0<m 


PTC/3 

PTP  ’ 


and  </>ma x(m)  =  max 

p-  mu0<™ 


pTcp 

W 


(5) 


The  minimal  eigenvalue  of  the  unrestricted  matrix  C  is  equivalent  to  fimin(p)-  If  the  number  of 
predictor  variables  p  is  larger  than  sample  size,  p  >  n,  this  eigenvalue  is  zero,  as  0mi n(m)  =  0 
for  any  m  >  n. 

A  crucial  factor  contributing  to  the  convergence  of  the  Lasso  estimator  is  the  behavior  of 
the  smallest  m-sparse  eigenvalue,  where  the  number  m  of  variables  over  which  the  minimal 
eigenvalues  is  computed  is  roughly  identical  to  the  sparsity  s  of  the  true  underlying  vector. 


2.2  Assumptions  for  high- dimensional  data 

We  make  some  assumptions  to  prove  the  main  result  for  high-dimensional  data.  We  under¬ 
stand  the  term  “high-dimensional”  to  imply  here  and  in  the  following  that  we  have  potentially 
many  more  predictor  variables  than  samples,  pn  3>  n.  While  we  are  mainly  interested  in  the 
pn  >  n  case,  the  results  are  also  relevant  for  pn  <  n  scenarios.  First,  a  convenient  technical 
assumption. 

Assumption  1  The  predictor  variables  are  normalized,  ||AA||£2  =  n  for  all  k,  n  and 

maxfcgpj  IIAAII^  <  cx). 

As  predictor  variables  are  normalized  in  practice  anyway,  the  first  part  of  the  assumption 
is  not  very  restrictive.  The  second  part  is  mostly  a  technical  assumption  and  simplifies 
exposition. 

Assumption  2  The  noise  satisfies  A  (exp  |£j|)  <  oo  and  E(ep)  =  a2  for  some  a2  >  0. 

The  assumption  of  exponential  tail  bounds  for  the  noise  is  fairly  standard  and  certainly 
covers  the  case  of  Gaussian  errors. 

Assumption  3  There  exists  some  a2  <  oo  such  that  E{Y2)  <  a2  for  all  i  6  N. 

This  assumption  is  equivalent  to  a  re-scaling  of  the  coefficient  vectors  P  so  that  the  signal- 
to-noise  ratio  stays  approximately  constant  for  all  values  of  n  in  our  triangular  setup. 
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Assumption  4  For  all  n,  the  maximal  eigenvalue  0max(min{n,p})  for  any  selection  of 
min{n,  p}  columns  is  bounded  from  above  by  some  finite  value.  The  minimal  eigenvalue 
for  any  selection  of  rnin{n,  p}  columns  is  strictly  positive,  0min(min{n,p})  >  0. 

Both  parts  of  the  assumption  could  be  relaxed  at  the  cost  of  increased  notational  complexity. 
It  might  be  interesting  to  check  if  the  assumptions  are  reasonable  for  a  random  design  matrix. 
The  latter  part  of  Assumption  4  is  fulfilled  with  probability  1  if  the  distribution  of  the  random 
predictor  variables  is  non-singular.  Consider  now  the  first  part  of  the  assumption  about  a 
bounded  maximal  eigenvalue.  To  be  specific,  assume  multivariate  normal  predictors.  If  the 
maximal  eigenvalue  of  the  population  covariance  matrix,  which  is  induced  by  selecting  n 
variables,  is  bounded  from  above  by  an  arbitrarily  large  constant,  it  follows  by  Theorem 
2.13  Davidson  and  Szarek  (2001)  or  Lemma  A3.1  in  Paul  (2006)  that  the  condition  number 
of  the  induced  sample  covariance  matrix  observes  a  Gaussian  tail  bound.  Using  an  entropy 
bound  for  the  possible  number  of  subsets  when  choosing  n  out  of  p  variables,  Assumption  4 
is  thus  fulfilled  with  probability  converging  to  1  for  n  —>■  oo  as  long  as  log pn  =  o(nK)  for 
some  k  <  1,  and  is  thus  maybe  not  overly  restrictive. 

2.3  Incoherent  designs 

As  apparent  from  the  interesting  discussion  in  Candes  and  Tao  (2005b),  one  cannot  allow 
arbitrarily  large  “coherence”  between  variables  if  one  still  hopes  to  recover  the  correct  sparsity 
pattern.  Assume  that  there  are  two  vectors  [3  and  (3  so  that  the  signal  can  be  represented  by 
either  vector  X/3  =  X/3  and  both  vectors  are  equally  sparse,  say  \\f3\\t0  =  lldllo  =  s  and  are 
not  identical.  We  have  no  hope  of  distingushing  between  f3  and  (5  in  such  a  case:  if  indeed 
X/3  =  X/3  and  (3  and  f3  are  not  identical,  it  follows  that  the  minimal  sparse  eigenvalue 
0min(2s)  =  0  vanishes  as  X((3  —  (3)  —  0  and  || (3  —  (3 ||^0  <  2s.  The  sparse  minimal  eigenvalue 
of  a  selection  of  order  s  variables  indicates  thus  if  we  have  any  hope  of  recovering  the  true 
sparse  underlying  vector  from  noisy  observations.  A  design  is  called  mn- incoherent  in  the 
following  if  the  minimal  eigenvalue  of  a  collection  of  mn  variables  is  bounded  from  below  by 
a  constant. 

Definition  4  (m„-incoherent  designs)  Let  mn  be  a  sequence  with  mn  =  o(n)  for  n  — >  oo. 
A  design  is  called  incoherent  for  mn  if  the  minimal  eigenvalue  of  a  collection  of  mn  variables 
is  bounded  from  below,  that  is  if 


lirninf  0min(mn)  >  0.  (6) 

n— XX) 

Our  main  result  will  require  a  sn\ogn-incoherent  design.  Most  of  the  previous  results  on 
Lasso  and  related  0 -constrained  estimators  have  used  similar,  if  slightly  stronger,  concepts. 
Donoho  and  Huo  (2001)  defined  the  mutual  coherence  M  between  two  orthonormal  basis 
as  the  maximal  absolute  value  of  the  inner  product  of  two  elements  in  the  two  orthonormal 
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basis.  One  could  extend  this  definition  to  arbitrary  dictionaries  where  basis  elements  are 
scaled  to  have  unit  norm.  Under  the  common  assumption  M  =  0{l/n),  the  design  is 
certainly  incoherent  in  the  meaning  above,  as  the  eigenvalue  of  any  selection  of  order  sn 
variables,  with  sn  =  o(n),  will  be  bounded  from  below  by  a  constant  for  sufficiently  large 
values  of  n.  The  notion  of  incoherence  above  covers  thus  a  wider  spectrum. 

Candes  and  Tao  (2005b)  use  a  Uniform  Uncertainty  Principle  (UUP)  to  discuss  the  con¬ 
vergence  of  the  so-called  Dantzig  selector.  The  UUP  can  only  be  fulfilled  if  the  minimal 
eigenvalue  of  a  selection  of  sn  variables  is  bounded  from  below  by  a  constant,  where  sn 
is  again  the  number  of  non-zero  coefficients  of  (3.  In  the  original  version  of,  a  necessary 
condition  for  (UUP)  is 


0min(^n)  “1“  0min  (2  sn)  +  (f)  min  (3s„)  >  2. 

In  some  sense,  this  requirement  is  weaker  than  sn  log  n-incoherent  design  as  the  minimal 
eigenvalues  are  calculated  over  maximally  3 sn  instead  of  sn  logn  variables.  In  another  sense, 
sn  log  n-incoherent  design  is  weaker  as  the  eigenvalue  can  be  bounded  from  below  by  an 
arbitrarily  small  constant. 

Incoherent  designs  and  the  irrepresentable  condition.  One  might  ask  in  what  sense 
the  notion  of  incoherent  designs  is  more  general  than  the  irrepresentable  condition.  At  first, 
it  might  seem  like  we  are  simply  replacing  the  strict  condition  of  irrepresentable  condition 
by  a  similarly  strong  condition  on  the  design  matrix. 

Consider  first  the  classical  case  of  a  fixed  number  p  of  variables.  If  the  covariance  matrix 
C  =  Cn  is  converging  to  a  positive  definite  matrix  for  large  sample  sizes,  the  design  is  auto¬ 
matically  incoherent.  On  the  other  hand,  it  is  easy  to  violate  the  irrepresentable  condition 
in  this  case;  for  examples  see  Zou  (2005). 

The  notion  of  incoherent  designs  is  only  a  real  restriction  in  the  high-dimensional  case 
with  pn  >  n.  Even  then,  it  is  clear  that  the  notion  of  incoherence  is  a  relaxation  from 
irrepresentable  condition ,  as  the  irrepresentable  condition  can  easily  be  violated  even  though 
all  sparse  eigenvalues  are  bounded  well  away  from  zero. 

2.4  Main  result  for  high-dimensional  data  (pn  >  n) 

Before  we  state  our  main  result,  we  would  like  to  use  and  explain  the  concept  of  active 
variables  (Osborne  et  ah,  2000;  Efron  et  ah,  2004)  so  that  the  penalty  parameter  has  a 
useful  interpretation  as  an  upper  bound  on  the  number  of  active  variables. 

Active  variables.  Let  Gx  be  the  p-dimensional  gradient  vector  with  respect  to  (3  of  the 
squared  error  loss,  Gx  =  (Y  —  X(3x)TX ,  where  (3X  is  the  Lasso  estimator.  It  follows  by  the 


KKT  conditions  or,  alternatively,  results  in  Osborne  et  al.  (2000)  or  Efron  et  al.  (2004)  that 
the  maximum  of  the  absolute  values  of  the  components  Gx  =  (Gx, . . . ,  Gx)  is  bounded  by  A, 

max  |G>|  <  A. 

1  <k<p 

We  call  variables  with  maximal  absolute  value  of  the  gradient  active  variables.  The  set  of 
active  variables  is  denoted  by 

Ax  ■=  {k  :  \GX\  =  A}.  (7) 

The  number  of  selected  variables  (variables  with  a  non- zero  coefficient)  is  at  most  as  large  as 
the  number  of  active  variables,  as  any  variable  with  a  non-zero  estimated  coefficient  has  to 
be  an  active  variable  (Osborne  et  ah,  2000).  In  Lemma  2,  we  derive  an  upper  probabilistic 
bound  on  the  number  of  active  variables  when  setting  the  penalty  parameter  to  A.  Set 

mx  :=  uj0max— ,  (8) 

where  fmax  =  0max(min{n,p})  is  the  bounded  maximal  eigenvalue  of  a  selection  of  min{rt,  p} 
variables.  Let  An  be  a  sequence  of  penalty  parameters.  Then,  with  probability  converging  to 
1  for  n  — >  oo,  we  have  that  AXn  <  mXn.  Instead  of  the  penalty  parameter  A,  we  will  often 
use  the  equivalent  value  of  m\,  as  it  offers  in  our  opinion  the  better  intuition. 


Main  result.  We  will  discuss  the  implications  of  the  theorem  after  the  proof,  as  its  inter¬ 
pretation  might  not  be  inaccessible  at  Erst. 


Theorem  1  (Convergence  in  £2_norm)  Let  Assumptions  1-4  be  satisfied  and  assume  the 
sn\ogn-incoherent  design  condition  (6).  Let  mXn  be  the  bound  (8)  on  the  number  of  active 
variables  under  penalty  parameter  \n.  The  l2-norm  of  the  error  is  then  bounded  for  n  — >  oo 
as 


W -Ml  <  0P 


log  Pn  mXn  \ 


(9) 


A  proof  is  given  in  Section  3. 


Remark  1  It  might  be  of  interest  to  compare  the  results  for  variance  and  bias  with  equiv¬ 
alent  results  for  orthogonal  designs.  In  the  case  of  orthogonal  designs,  each  OLS-coefficient 
j3k,  k  =  1, ...  ,p  is  soft-thresholded  by  the  quantity  n~l \n  to  get  j3f.  The  squared  bias  of 
coefficients  with  \(3k\  3>  n-1  An  is  thus  n~2 A^  (under  the  condition  that  n  is  sufficiently  large, 
so  that  |/3° |  >  n_1  An  with  high  probability).  The  total  squared  bias  is  thus  sn/mXn,  which 
is  identical  to  the  order  we  derive  for  incoherent  designs  so  the  bound  cannot  be  improved 
in  general. 

The  variance  part  can  also  be  compared  to  the  variance  of  an  estimator  for  orthogonal 
designs,  which  is  proportional  to  mXn/n,  the  number  of  selected  parameters  divided  by  the 
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sample  size.  In  our  result,  we  get  an  additional  logp„  factor,  which  stems  from  the  fact  that 
the  subset  of  m\n  variables  can  be  chosen  among  pn  variables.  An  additional  factor  of  the 
reciprocal  of  0min(mA„)  adjusts  for  correlated  designs. 

Remark  2  It  can  be  seen  from  the  proofs  that  non-asymptotic  bounds  can  be  obtained 
with  essentially  the  same  results.  As  the  constants  of  the  non-asymptotic  bounds  are  not 
very  tight,  we  choose  to  present  only  the  asymptotic  result  for  clarity  of  exposition. 

Organization  of  the  remainder.  The  main  question  of  concern  for  us  is:  under  what 
circumstances  can  we  find  a  sequence  rri\n  such  that  both  the  variance  and  bias  term  vanish 
asymptotically?  If  such  a  sequence  exists,  then  we  know  that  there  is  a  sequence  of  penalty 
parameters  \n  so  that 

\\P  -/^An life  ->  0  n  — >  oo. 

Sufficient  conditions  for  ^-consistency  for  £0-sParse  vectors  are  derived  in  the  following  sec¬ 
tion.  Thereafter  results  will  be  extended  to  vectors  in  weak  f^-balls.  We  will  define  the 
notion  of  effective  sparsity  and  show  that  there  is  a  one-to-one  correspondence  between  the 
results  for  £0-sparse  vectors  and  vectors  in  weak  £g-balls. 

Lastly,  a  major  implication  of  the  results  is  shown,  namely  that  the  Lasso  can  be  tuned 
to  reliably  pick  all  important  variables  if  selecting  a  small  subset  of  the  total  number  of 
variables.  As  already  know,  some  unimportant  variables  will  unfortunately  also  be  included 
in  this  set,  but  can  be  removed  in  a  second  stage. 

2.5  ^-consistency 

We  can  immediately  derive  sufficient  conditions  for  ^-consistency  in  the  sense  of  (4),  asking 
under  what  circumstances  there  exists  a  penalty  parameter  sequence  \n  so  that  || (3  —  (3  Anlk 
converges  in  probability  to  0  for  large  values  of  the  sample  size  n.  The  following  corollary  can 
be  derived  from  Theorem  1  by  choosing  m\n  =  sn  log  n,  using  the  incoherence  assumption  (6). 

Corollary  1  (^-consistency)  Let  the  assumptions  of  Theorem  1  be  satisfied.  The  Lasso 
estimator  is  £2- consistent  under  the  condition  that 


The  result  allows  thus  for  the  number  of  relevant  variables  sn  to  grow  almost  as  fast  as  the 
number  of  samples  n  if  pn  is  not  exponential  in  n,  while  still  enjoying  ^-consistency. 

Remark  3  To  achieve  the  most  general  result  for  ^-consistency,  the  rate  A„,  of  the  penalty 
parameter  has  to  depend  on  the  unknown  sparsity  sn.  The  results  offer  thus  not  so  much 
help  in  picking  the  correct  penalty  parameter,  but  merely  states  that  somewhere  along  the 
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solutions  paths  (when  varying  A)  there  is  a  solution  close  to  the  true  vector.  How  to  choose 
the  penalty  parameter  in  a  data  driven  way  is  further  research. 

Under  less  general  circumstances,  we  can  achieve  ^-consistency  with  a  fixed  penalty  param¬ 
eter  sequence  that  does  not  depend  on  the  unknwon  smoothness.  Specifically,  if  limiting 
the  growth  rate  of  sn,pn  to  be  sn  <C  npl  and  n /  log pn  S>  np2  for  some  0  <  p\  <  P2  <  1, 
then  any  sequence  m\n  x  np  with  pi  <  p  <  p2  achieves  ^-consistency,  irrespective  of  the 
actual  sparsity,  as  long  as  the  stronger  incoherence  assumption  lim  infn_>oc,  (j>min(rri\n)  >  0  is 
fulfilled. 

2.6  Some  results  for  weak  /g-balls 

So  far,  we  have  been  assuming  that  the  vector  (3  is  sparse  in  an  f'o-sense,  with  most  entries  of 
f3  being  identically  zero.  This  is  a  conceptually  simple  assumption.  It  is  easy  to  formulate  the 
problem  and  understand  the  results  for  the  f'o-sparse  setting.  In  the  end,  however,  it  might 
be  overly  simplistic  to  assume  that  most  of  the  entries  are  identically  zero.  It  is  perhaps 
more  interesting  to  assume  that  most  of  the  entries  are  very  small,  as  is  the  case  for  wavelet 
coefficients  of  natural  images  (Joshi  et  ah,  1995;  LoPresto  et  al.,  1997;  Mallat,  1989).  We 
can  for  example  consider  the  case  that  the  vector  (3  lies  in  a  weak  £q-ball  with  0  <  q  <  2.  Let 
|/3(i) |  >  |/3(2) |  >  •  •  ■  >  \/3(p)\  be  the  ordered  entries  of  /3.  The  vector  /3  lies  in  a  weak  id-ball  if 
there  exists  a  constant  sqtU  >  0  such  that 

VI  <k<p:  \/3(k)\  <  sq}n  k~1/q.  (11) 

If  a  vector  (3  has  a  £q- ( quasi- ) norm  ||/?||^  then  it  also  lies  in  a  weak  £q-hn\\  with  the  sparsity 
Sqtn,  that  is 

Sq,n  5;  1 1/3 1| 

In  the  /^-sparse  setting,  it  does  not  make  sense  trying  to  recover  the  correct  sparsity  pattern, 
as  all  coefficients  are  in  general  different  from  zero.  We  can,  however,  ask  if  the  most 
important  coefficients  are  recovered,  neglecting  coefficients  with  very  small  absolute  value. 
Consider  the  case  0  <  q  <  1,  where  coefficients  decay  faster  than  1  /k.  As  can  be  seen  in  the 
following,  the  bound  on  the  ^-distance  between  [3  and  its  estimate  (3X  is  very  similar  to  the 
f'o-sparse  setting. 

Effective  Sparsity.  There  is  a  simple  connection  between  the  results  for  f'o-sparse  and 
Cj-sparse  vectors.  Specifically,  we  are  interested  in  settings  where  /2-consistency  of  the  Lasso 
estimator  can  be  achieved.  If  we  define  the  effective  sparsity  as  sq:n  raised  to  the  power  of 
2 qj  (2  —  q),  the  results  of  the  /0-sParse  setting  are  directly  applicable  to  the  /^-sparse  setting 
with  0  <  q  <  1. 
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Definition  5  (Effective  sparsity) 

with  sparsity  sqtU  is  defined  as 


The  effective  sparsity  sff  of  a  vector  (3  in  a  weak  £q-ball 


2q 


2-9 

q,n 


(12) 


To  motivate  the  notion  of  effective  sparisty,  suppose  that  the  decay  of  the  absolute  value  of 
the  components  of  (3  is  fast.  A  good  approximation  to  /3  in  the  f^-sense  can  then  be  obtained 
by  retaining  just  a  few  large  components  of  (3.  Assume  that  the  entries  of  [3  are  ordered  so 
that  l/AI  >  | (3f\  >  . . .  >  \(3P\.  Let  (3d  be  the  approximation  that  retains  only  the  d  largest 


components, 


f  (3k  k  <  d 

\  0  d  >  s 


The  effective  sparsity  measures  the  minimal  number  d  =  dn  of  non- zero  components  neces¬ 
sary  to  obtain  an  approximation  f3dn  of  f3  that  satisfies 


II  Pd"  -  P Ik  ^  0  n  ->  00. 


(13) 


To  be  precise,  let  B  be  the  set  B  —  {(3  :  \\f3\\iq  <  sqiTl  for  all  n}.  Then,  for  any  sequence  dn 
which  satisfies  (13)  for  every  vector  (3  G  B,  the  number  of  retained  coefficients  dn  needs  to 
be  at  least  of  the  same  order  as  the  effective  sparsity ,  that  is  lim  inf,woo  dn/sff  >  0.  On  the 
other  hand,  retaining  exactly  sff  components  satisfies  (13)  for  every  vector  (3  £  B.  A  proof 
of  this  is  straightforward  but  omitted  here.  The  notion  of  effective  sparsity  is  thus  inherent 
to  the  nature  of  the  problem. 

The  definition  of  effective  sparsity  will  be  helpful  in  the  following.  Suppose  we  want  to  see 
whether  ^-consistency  can  be  achieved  for  vectors  in  a  weak  f^-ball  with  sparsity  sq^n.  The 
effective  sparsity  of  this  setting  can  then  be  calculated  according  to  (12).  We  can  then  look 
at  the  f'o-sparse  vectors,  where  the  number  of  non- zero  entries  is  set  to  the  effective  sparsity 
of  the  original  problem.  If  ^-consistency  can  be  achieved  for  the  £o-sParse  setting,  it  can 
also  be  achieved  for  the  £g-sparse  setting.  With  the  notion  of  effective  sparsity,  a  bound  on 
the  ^-distance  between  the  Lasso  estimator  and  the  true  vector  can  be  derived. 


Theorem  2  Let  the  assumptions  of  Theorem  1  be  fulfilled,  except  that  the  vector  (3  is  only 
assumed  to  be  in  a  weak  £q-ball  for  some  0  <  q  <  1.  Let  again  m\n  be  the  bound  on  the 
number  of  active  variables  (8)  and  assume  mn-incoherent  design.  Then 


W  -H\l 


A  proof  is  given  in  Section  4. 

The  place  of  the  sparsity  measure  sn,  the  number  of  non-zero  elements,  is  now  taken  by 
the  effective  sparsity  sffi.  The  implications  of  Theorem  2  are  similar  to  those  of  £o-sparse 
vectors.  The  Lasso  is  thus  able  to  recover  not  only  f'o-sparse  vectors  but  also  vectors  which 
are  sparse  in  the  sense  of  lying  in  a  weak  f^-ball  for  some  small  value  of  q. 
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2.7  Sign  consistency  with  two-step  procedures 

The  results  show  that  the  Lasso  estimator  can  be  made  sign  consistent  in  a  two-step  proce¬ 
dure  even  if  the  irrepresentable  condition  is  relaxed  but  under  the  assumption  that  non-zero 
coefficients  of  [3  are  “sufficiently”  large.  One  possibility  is  hard-thresholding  of  the  obtained 
coefficients,  neglecting  variables  with  very  small  coefficients.  This  effect  has  already  been  ob¬ 
served  empirically  in  (Valdes-Sosa  et  ah,  2005).  Other  possibilities  include  soft-thresholding 
and  relaxation  methods  such  as  the  Gauss-Dantzig  selector  (Candes  and  Tao,  2005b)  or  the 
Relaxed  Lasso  (Meinshausen,  2006)  with  an  additional  thresholding  step. 

We  start  with  a  corollary  that  follows  directly  from  Theorem  1,  stating  that  important 
variables  are  chosen  with  high  probability.  Let  L  be  the  subset  of  large  coefficients  and  Z 
be  the  subset  of  zero  coefficients, 

L  := 

z  :=  {k:fik  =  0} 

where  an  3>  bn  is  again  meant  to  imply  that  an/bn  — »  oo  for  n  — >  oo.  The  following  corollary 
states  that  the  Lasso  can  distinguish  between  variables  in  L  and  Z. 

Corollary  2  Let  the  assumptions  of  Theorem  2  be  satisfied  and  assume  ^ snn -incoherent 
design.  There  exists  a  penalty  sequence  \n  such  that,  with  probability  converging  to  1  for 
n  — >  oo,  the  absolute  value  of  coefficients  in  L  is  larger  than  the  absolute  value  of  any 
coefficient  in  Z , 

Vk,eL,kzeZ:  \ft;\  >  \^\.  (14) 

Proof.  The  proof  follows  from  the  results  of  Theorem  1.  Event  (14)  is  fulfilled  if  || (3Xn  — 
dlloo  <  miiifcgL  \(3k\.  Choosing  a  penalty  sequence  Xn  with  m\n  =  ns n  log-1  pn  yields  with 
Theorem  1  that  ||  j3  Xn~P\\l  <  Opfn  1  sn  log Pn)  •  The  bound  on  the  ^-distance  gives  trivially 
the  identical  bound  on  the  Co-distance  between  f3Xn  and  f3.  Furthermore,  by  definition  of 
the  set  L,  minkeL  \/3k\  (n-1sri  logpn)1//4,  which  comples  the  proof.  □ 


Remark  4  The  corollary  implies  that  variables  with  sufficiently  large  regression  coefficients 
are  chosen  with  very  large  probability  by  the  Lasso,  that  is  for  n  oo, 

P(fik  E  L  :  ±  0)  — >  1. 

As  also  some  additional  unwanted  variables  are  chosen  (which  cannot  be  avoided  if  the 
irrepresentable  condition  is  violated),  the  result  implies  that  the  Lasso  is  successful  in  nar¬ 
rowing  down  the  choice  of  pn  3>  n  variables  to  a  subset  of  variables  with  cardinality  much 
smaller  than  n  (at  least  of  smaller  order  than  nsn ).  All  important  variables  are  with  large 
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probability  in  this  much  smaller  Lasso-selected  subset.  A  two-step  procedure  would  try  to 
filter  out  those  important  variables  from  the  selected  subset.  Consistent  variable  selection 
could  for  example  be  achieved  by  simple  thresholding  of  small  coefficients  in  the  initial  Lasso 
estimator. 

Remark  5  It  is  apparent  that  the  large  bias  of  the  Lasso  estimator  allows  only  for  a  slow 
rate  of  decay  of  coefficients  in  the  set  L.  To  alleviate  this  problem,  one  could  first  reduce 
the  bias  of  the  selected  coefficients  and  apply  thresholding  after  this  relaxation  step.  Even 
though  we  view  this  bias- reduction  step  as  important,  we  refrain  from  giving  more  details 
due  to  space  constraints. 

In  conclusion,  even  though  one  cannot  achieve  sign  consistency  in  general  with  just  a  single 
Lasso  estimation,  it  can  often  be  achieved  in  a  two-stage  procedure. 

3  Proof  of  Theorem  1 

The  first  term  on  the  right  hand  side  of  (9)  is  a  variance-type  and  the  second  term  rep¬ 
resent  a  bias-type  contribution.  Let  (3X  be  the  estimator  under  the  absence  of  noise,  that 
is  (3X  =  /3X,°,  where  (3X^  is  defined  as  in  (30).  The  ^-distance  can  then  be  bounded  by 
||/3A  —  /3|||2  <  2\\fix  —  /3x\\j2  +  2\\(3X  —  (3\\^2.  The  first  term  on  the  right  hand  side  represents 
the  variance  of  the  estimation,  while  the  second  term  represents  the  bias.  The  bound  on  the 
variance  term  follows  by  Lemma  6.  The  bias  contribution  follows  directly  from  Lemma  1.  □ 


3.1  Part  I  of  Proof:  Bias 

Let  K  be  the  set  of  non-zero  elements  of  /3,  that  is  K  =  {k  :  (3k  0}.  The  cardinality  of  K 

is  again  denoted  by  s  =  sn.  For  the  following,  let  f3x  be  the  estimator  f3x  under  the  absence 
of  noise,  a  =  0.  The  solution  (3X  can,  for  each  value  of  A,  be  written  as  (3X  =  (3  +  yA,  where 

7A  =  argminCeRP  /(C),  (15) 

where  the  function  /(C)  is  given  by 

/(0  =  «cTcc  +  a  iai  (iA  +  ai  -  iai).  (k) 

keKc  k£l< 

The  vector  yA  is  the  bias  of  the  Lasso  estimator.  We  derive  first  a  bound  on  the  l^-norm  of 


Lemma  1  Let  hi  >  0  be  the  minimal  eigenvalue  with  lim  inf,woo  0min(sn  log  sn)  >  k.  The 
^2 -norm  of  yA,  as  defined  in  (15),  is  bounded  for  sufficiently  large  values  of  n  by 


ll7A|k  < 


n  n 
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Proof.  We  write  in  the  following  7  instead  of  7A  for  notational  simplicity.  Let  7 (K)  be 
the  vector  with  coefficients  7*,  (AT)  =  %l{k  G  Kj,  that  is  7 (A')  is  the  bias  of  the  truly 
non-zero  coefficients.  Analogously,  let  7(A'C)  be  the  bias  of  the  truly  zero  coefficients  with 
7fc(A'c)  =  7fcl{/c  A'}.  Clearly,  7  =  7(A')+7(A'C).  The  value  of  the  function  /(C),  as  defined 
in  (16),  is  0  if  setting  C  =  0.  For  the  true  solution  ryx,  it  follows  hence  that  /( 7A)  <  0.  Hence, 
using  that  (TC(  >  0  for  any  C, 

Il7(tfc)lk  =  V  |Ct|  <  |  £  (IA  +  &I  -  IAI)|  <  IM*0llv  (17) 

keKc  k£l< 

As  ||7(A')||4  <  sn,  it  follows  that  ||7(A')||q  <  y/s^\\'y(K)\\h  <  ks/IMk  and  hence,  using 

(17),  __ 

||7|k  <  2v^||7lk-  (18) 

This  result  will  be  used  further  below.  We  use  now  again  that  /( 7A)  <  0  (as  C  =  0  yields  the 
upper  bound  /(C)  =  0).  Using  the  previous  result  that  ||7(AT)|k  <  1 1 7 1 U2 ,  an<l  ignoring 

the  non-negative  term  ||7(A'c)||q,  it  follows  that 

n^C'y  <  A  1 1 7 1 U2  •  (I9) 

Consider  now  the  term  7TC*7.  Bounding  this  term  from  below  and  plugging  the  result 
into  (19)  will  yield  the  desired  upper  bound  on  the  ^2-norm  of  7.  Let  |7(i)|  >  1 7(2) |  >  . . .  > 
I 7(p)  I  l*e  llie  ordered  entries  of  7. 

Let  {un}neN  be  a  sequence  of  positive  integers,  to  be  chosen  later,  and  define  the  set  of  the 
“un-largest  coefficients”  as  U  —  {k  :  |7^ |  >  |7(«„)|}.  Dehne  analogously  to  above  the  vectors 
7 (U)  and  7 (Uc)  by  7 k(U)  =  7fcl{&  G  U}  and  7 k(Uc)  =  7fcl {k  U}.  The  quantity  7 TC 7  can 
be  written  as 

lrCl  =  ~r(U)TC7(U )  =  ||o  +  6117  (20) 

where  a  :=  n~1^2X'y(U)  and  b  :=  n~1/2X'y(Uc).  Then 

7TC7  =  aTa  +  2 bT a  +  bTb  >  \\a\\22  —  2||a|k||6|k.  (21) 

As  7 (U)  has  by  definition  only  un  non-zero  coefficients, 

Ml  =  UU)TCl(U)\\l  >  0minK)  117(77) III  =  0minK)(||7lll  -  hmwl).  (22) 

As  7(f/c)  has  at  most  n  non-zero  coefficients, 

Ml  =  UUC)TCl{Uc)\\l  <  /max(n)||7(f/c)|ll  (23) 

Using  (22)  and  (23)  in  (21), 

Ml  -  2|MU&k  >  CwKXIMIl  -  hm\\l) 

-2^) 

min  (^n)  0  max  Wll7lkll7(£7c)|k. 
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and  hence 


lTCl  >  <Pmin(Un)h\\2t2 


0maxH  hjU^Wh 

0min  (Wn)  ||tII^2 


bollix 
hill  v 


(24) 


Before  proceeding,  we  need  to  bound  the  norm  ||7(Pc)|k  as  a  function  of  un.  Assume  for 
the  moment  that  the  £ \ -norm  HyH^  is  identical  to  some  £  >  0.  Then  it  holds  for  every 
k  —  1, . . ,  ,p  that  7 (fc)  <  £/k.  Hence, 


h(^)lll  <  hill 


V 


k=un  + 1 


Ur, 


(25) 


having  used  the  result  (18)  from  above  that  ||7||^  <  7 1|^2 .  Plugging  this  result 

into  (24), 


Choosing  a  sequence  un  =  ,sn  log  n,  it  holds  with  the  assumption  liminf,woo  dmin  (sn  log  n)  > 
K,  that  for  sufficiently  large  values  of  n,  by  the  assumption  of  a  bounded  maximal  eigenvalue 

0max(u) , 

7TC'7  >  ^117111- 


Using  the  last  result  together  with  (19),  which  says  that  7rC7  <  n  1A1/s^||7||£2,  it  follows 
that  for  large  n, 


hlk  < 


n  k 


which  completes  the  proof. 


□ 


3.2  Part  II  of  Proof:  Variance 

First,  bounds  for  the  number  of  selected  and  active  variables  are  derived.  These  bounds  are 
later  used  to  assess  the  variance  of  the  estimator  under  noisy  observations. 

A  bounds  on  the  number  of  active  variables  A  decisive  part  in  the  variance  of  the 
estimator  is  determined  by  the  number  of  selected  variables.  Instead  of  directly  bounding 
the  number  of  selected  variables,  we  derive  bounds  for  the  number  of  active  variables.  As 
any  variable  with  a  non-zero  regression  coefficient  is  also  a  active  variable ,  these  bounds  lead 
trivially  to  bounds  for  the  number  of  selected  variables. 

Let  again  A\  be  the  set  of  active  variables, 

Ax  =  {k:  \Gxk\  =  A}. 
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The  number  of  selected  variables  (variables  with  a  non-zero  coefficient)  is  at  most  as  large 
as  the  number  of  active  variables,  as  any  variable  with  a  non-zero  estimated  coefficient  has 
to  be  an  active  variable  (Osborne  et  ah,  2000). 


Lemma  2  With  probability  tending  to  1  for  n  — >  oo,  the  number  \A\\  of  active  variables  of 
the  estimator  /3X  is  bounded  by 

\Ax\  <  cr^max(IAI)^  <  a^max(min{n,p})—  :=  mX) 

where  0max(|-4A|)  is  the  bounded  maximal  eigenvalue  of  a  selection  of  at  most  |*4a|  <  miri{n,  p} 
variables. 

Proof.  Let  R( A)  be  the  vector  residuals,  R( A)  =  Y  —  Xf3x.  For  any  k  in  the  |*4a|- 
dimensional  space  spanned  by  the  active  variables, 

\Gx\  =  \RT(X)Xk\  =  X.  (27) 

Let  Ra(X)  be  the  projection  PAR( X)  of  the  residuals  R( X)  into  the  \A\ I -dimensional  space 
spanned  by  the  |*4a|  active  variables.  Then,  by  (27), 

\\XTARAml  =  l|Vjfl(A)||?s  =  |A|A2.  (28) 

As  RA( X)  is  the  projection  onto  the  space  spanned  by  \A\\  active  variables,  it  holds  that  for 
|^4.a  |  A  n,  with  the  notation  v  =  XAR(X), 

Ra{  A)  =  Xa(XTaXa)-W, 

and  hence 

\\rAW\\12  =vt(X^Xa)-1v  >  {n0max( |^4A |)}-1  H'y Ho¬ 
using  the  result  (28),  it  follows  that  ||u|||,  >  |.4.a|A2  and  hence 

l|fl'4(A)|||  >  {^„«(IAI)}-‘|aIa|A2. 

The  sum  of  squared  residuals  is  bounded  uniformly  over  all  subsets  M  of  the  active  vari¬ 
ables  by  ||i?-4(A)||22  <  ||P||22.  By  assumption  3,  it  holds  with  probability  converging  to  1  for 
n  — a  oo,  that  ||F||22  <  na2,  which  completes  the  proof.  □ 
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De-noised  response  We  need  for  the  following  a  little  extension  of  the  result  above. 
Define  for  0  <  £  <  1  the  de-noised  version  of  the  response  variable, 

Y(£)=AS  +  £e.  (29) 

We  can  regulate  the  amount  of  noise  with  the  parameter  £.  For  £  =  0,  only  the  signal  is 
retained.  The  original  observations  with  the  full  amount  of  noise  are  recovered  for  £  =  1. 
Now  consider  for  0  <  £  <  1  the  estimator  /A, 

=  argminjYXO  -  X(3\\%  +  A||/%.  (30) 

The  ordinary  Lasso  estimate  is  recovered  under  the  full  amount  of  noise  so  that  (3X)l  =  /3X. 
Using  the  notation  from  the  previous  results,  we  can  write  for  the  estimate  in  the  absence 
of  noise,  /3A,°  =  /3X.  The  definition  of  the  de-noised  version  of  the  Lasso  estimator  will  be 
helpful  for  the  proof  as  it  allows  to  characterize  the  variance  of  the  estimator. 

Number  of  active  variables  for  the  de-noised  estimator.  In  analogy  to  the  gradient 
Gx  of  the  loss  function,  let  Gx A  be  the  gradient  vector  with  respect  to  f3  of  the  squared  error 
loss  when  estimating  the  de-noised  version  of  the  observations, 

Gx't  =  (Y(£)  -  Xpx£)TX.  (31) 

Variables  are  called  again  active  if  the  absolute  value  of  the  respective  gradient  is  equal  to 
the  maximal  value  A.  The  active  variables  are  denoted  by  A,£-  The  result  of  Lemma  2  can 
now  easily  be  shown  to  hold  uniformly  over  all  de-noised  versions  of  the  estimator. 

Lemma  3  With  probability  converging  to  1  forn  — >  oo,  the  number  |A,d  of  active  variables 
of  the  de-noised  estimator  is  bounded  by 

1 1  - 

|  Ad  —  T/  Ao  dm  ax  (  A, £  |  • 

Proof.  The  proof  follows  analogously  to  the  proof  of  Lemma  2.  In  analogy  to  R A,  let 
Ra,Z  be  the  projection  of  the  residuals  Y(£)  —  Xf3x ^  onto  the  space  spanned  by  the  active 
variables.  The  bound  ||hA||£2  <  sup0<^<1  HAOIIfe  holds  uniformly  over  all  values  of  £  with 
0  <  £  <  1.  By  assumption  3,  the  term  n_1sup0<^<1  ||Y(£)||q>  is,  with  probability  converging 
to  1  for  n  — >  oo,  bounded  by  a2y.  The  proof  follows  then  exactly  like  the  proof  of  Lemma  2.  □ 

This  uniform  bound  is  used  below  to  bound  the  variance  of  the  Lasso  estimator. 
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Variance  of  restricted  OLS  Next,  we  consider  the  variance  of  the  Lasso  estimator  as 
a  function  of  the  penalty  parameter.  Let  9M  e  be  for  every  subset  M  C  {1, . . .  ,p}  with 
\M\  <  n  the  restricted  OLS-estimator  of  the  noise  vector  e, 

eM  =  {xIXmY'xIe.  (32) 


First,  we  bound  the  f^-norm  of  this  estimator.  The  result  is  useful  for  bounding  the  variance 
of  the  final  estimator,  based  on  the  derived  bound  on  the  number  of  active  variables. 


Lemma  4  The  -norm  of  the  restricted  estimator  9M  is  uniformly  over  all  sets  M  with 
\M\  <  m\n,  where  m\n  is  as  defined  in  (8),  bounded  for  n  — >  oo  by 


max 

M:\M\<mXn 


\\oM\\l 


Or 


log  Pn  mXn  \ 


Proof.  It  follows  directly  from  the  definition  of  9M  that,  for  every  M  with  \M\  <  rri\n , 

1 


\\0 


M  || 2 


< 


£2  n2(PLn(m\r, 
It  remains  to  be  shown  that,  for  n  — >  oo, 


-Wm4L 


(33) 


„  max  \\XMe\\t  =  Op (mXnn  log pn). 


As  FJ(exp|£j|)  <  oo  and  maxfe  HAAH^  <  oo,  it  follows  by  Bernstein’s  inequality  that 
|Xje|2  =  Op(n)  for  every  k  <  pn  and  due  to  the  exponential  tail  bound 

max  |Vje|2  =  Op(n\ogpn)] 

k<pn 

and  hence, 

,  r  max  \\XIje\W  <  m\  max  |Xje|2  =  Op(nmXn  log pn). 

M:\M\<mXn  k<pn 

Using  this  in  conjunction  with  (33)  completes  the  proof.  □ 


Variance  of  estimate  is  bounded  by  restricted  OLS  variance  We  show  that  the 
variance  of  the  Lasso  estimator  can  be  bounded  by  the  variances  of  restricted  OLS  estimators, 
using  bounds  on  the  number  of  active  variables. 

Lemma  5  If,  for  a  fixed  value  of  A  ,  the  number  of  active  variables  of  the  de-7ioised  estimators 
[3X,Z  is  for  every  0  <  £  <  1  bounded  by  m,  then 

||/3a’°  —  fdx,1\\2  <  max  \\9M\\}2. 

2  M:\M\<m  2 
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Proof.  The  key  in  the  proof  is  that  the  solution  path  of  (3X if  increasing  the  value  of  £ 
from  0  to  1,  can  be  expressed  piecewise  in  terms  of  the  restricted  OLS  solution.  The  set 
M(£)  of  active  variables  is  the  set  with  maximal  absolute  gradient, 

m)  =  {*  :  |GTl  =  A}. 

Note  that  the  estimator  (3X^  and  also  the  gradient  are  continuous  functions  in  both  A 
and  £  (Efron  et  ah,  2004).  Let  0  =  £i  <  £2  <  . . .  <  £l+i  =  1  be  the  points  of  discontinuity  of 
M(£).  At  these  locations,  variables  either  join  the  active  set  or  are  dropped  from  the  active 
set. 

Fix  some  j  with  1  <  j  <  J .  Denote  by  Mj  the  set  of  active  variables  M (£)  for  any 
£  E  (£j,  £j+i).  We  show  in  the  following  that  the  solution  (3X ^  is  for  all  £  in  the  interval 
(£p  £j+i)  given  by 

€&•,&+ 1):  /3«=(3«>  +  K-?,)9M>,  (34) 

where  9Mj  is  the  restricted  OLS  estimator  of  noise,  as  defined  in  (32).  The  local  effect  of 
increased  noise  (larger  value  of  £)  on  the  estimator  is  thus  to  shift  the  coefficients  of  the 
active  set  of  variables  along  the  least  squares  direction. 

Once  (34)  is  shown,  the  claim  follows  by  piecing  together  the  piecewise  linear  parts  and  using 
continuity  of  the  solution  as  a  a  function  of  £  to  obtain 


\\px’°  -/3Aai 


e2  < 


E 

3= 1 


\h 


J 


<  max  \\9m  b2  y^(£j+i  —  £0  =  max  \WM  U2. 

3= 1 


It  thus  remains  to  show  (34).  A  necessary  and  sufficient  condition  for  $x^  with  £  6  (£y,  £y+i) 
to  be  a  valid  solution  is  that  for  all  k  E  Mj  with  non-zero  coefficient  /3^’s  7^  0,  the  gradient 
is  equal  to  A  times  the  negative  sign, 

=  -Asign(/3^),  (35) 

that  for  all  variables  with  k  E  Mj  with  zero  coefficient  j3^  =  0  the  gradient  is  equal  in 
absolute  value  to  A 

\GX/\  =  A,  (36) 

and  for  variables  k  (f  Mj  not  in  the  active  set, 

|G«|  <  A.  (37) 


These  conditions  are  a  consequence  of  the  requirement  that  the  subgradient  of  the  loss 
function  contains  0  for  a  valid  solution. 
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Note  that  the  gradient  of  the  active  variables  in  Mj  is  unchanged  if  replacing  £  G  (£j,£j+i) 
by  some  £'  G  (£j,£j+i)  and  replacing  /3A,€  by  /3X^  +  (£'  —  £)0MU  That  is,  for  all  k  G  Mj, 

(r«)  -  xp^fxk  =  {!"(?')  -  +  (f  -  £)0">)}Tw, 

as  the  difference  of  both  sides  is  equal  to 

and  (e  —  X9Mj^T Xk  =  0  for  all  k  G  Mj,  as  9Mj  is  the  OLS  of  e,  regressed  on  the  variables  in 
Mj.  Equalities  (35)  and  (36)  are  thus  fulfilled  for  the  solution  and  it  remains  to  show  that 
(37)  also  holds.  For  sufficiently  small  values  of  £'  —  £,  inequality  (37)  is  clearly  fulfilled  for 
continuity  reasons.  Note  that  if  |£r  —  £|  is  large  enough  such  that  for  one  variable  k  Mj 
inequality  (37)  becomes  an  equality,  then  the  set  of  active  variables  changes  and  thus  either 
£'  =  £j+i  or  £'  =  £j.  We  have  thus  shown  that  the  solution  /3X^  can  for  all  £  G  (£j,£j+i)  be 
written  as 

which  proves  (34)  and  thus  completes  the  proof.  □ 


Lemma  6  Under  the  assumptions  of  Theorem  1,  the  variance  term  is  bounded  by 


w-ni  =  o, 


/log  Pr 


rn\n 


PX  n 


where  mXn  =  n2 X  2a2(j)max(mm{n,p}). 

By  Lemma  5  and  4,  the  variance  can  be  bounded  by 


IIT-Tll l  =  o. 


/log  Pr, 


m 


V  n 


\m) 


where  m  =  sup0<^<1  mx ^  is  the  maximal  number  of  active  variable  of  the  de-noised  esti¬ 
mate  (30).  Using  Lemma  3,  the  number  m  of  active  variables  is  bounded,  with  probability 
converging  to  1  for  n  — >  oo,  by  rri\n  =  n2 A/ 2 a2 (j)msx  ( min {  n ,  p  } ) ,  which  completes  the  proof. 


4  Proof  of  Theorem  2 

The  proof  of  Theorem  2  is  in  most  parts  analogous  to  the  proof  of  Lemma  1  as  the  bound 
on  the  variance  part  remains  unchanged.  Only  a  bound  on  the  /2-norm  of  the  bias  has  to  be 
recalculated.  First,  we  derive  a  bound  on  the  /i-norrn  of  yA,  similar  to  (17).  The  solution 
Px  can  be  written  as  f3  +  yA,  where 

7A  =  argminCeRP  /(C),  (38) 
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and  the  function  /(£)  defined  in  (16)  is  now  written  as 

f(0  =  n(TCC  +Aj](|A  +  a|-|AI). 

k<p 

In  the  proof  of  Lemma  1,  K  was  the  set  of  variables  with  truly  non-zero  coefficients.  In 
the  current  setting,  all  coefficients  are  potentially  different  from  zero.  Let  instead  K  be  in 
the  following  the  set  of  variables  whose  coefficient  is  among  the  rn  largest,  where  rn  is  some 
sequence  (to  be  chosen  later)  with  rn  — »  oo  for  n  — >  oo.  Assume  without  loss  of  generality 
that  |/?i|  >  |/?2 1  >  ■  ■  ■  >  \(3P\-  Then, 

K  =  {k  :  k  <  rn}. 

We  can  use  again  that  /(qA)  <  0  as  setting  £  =  0  yields  already  the  upper  bound  /( 0)  =  0. 
Using  that  C  is  positive  semi-definite, 

Y  (\Pk  +  lfk\  ~  \Pk\)  <  ~  Y^^k  +  T*l  “  i&d- 

keKc  keK 

Note  that  on  the  one  hand  \c+d\  —  |rf|  <  |c|  for  all  c,  d  G  M.  On  the  other  hand,  \c+d\  —  |rf|  > 
|c|  —  2\d\  for  all  c,  d  G  M.  Thus 

ii7(/v)||(1  -  2\\mc)\w  =  E  (i>  i  -  m\)  <  £  m  =  iMA’)ik' 

keKc  keK 

As  /3  lies  in  a  weak  t'^-ball  with  0  <  q  <  1,  we  have,  by  summing  up  the  smallest  entries  of 
/ 3  that  there  exists  some  constant  c  >  0  so  that 

m(Kc)\\h  <  cs,,nr S'-1’7’, 

and  hence  similarly  as  before 

blk  =  \\^(Kc)\\h  +  \\j(K)\\ei  <  2\\^(K)\\h+cSq:nr<nq~1)/q  <  2y/r^\\'y\\ta  +  csqinr%-1)/q,  (39) 

where  the  last  inequality  follows  due  to  7 (K)  having  at  most  rn  non- zero  entries,  essentially 
by  definition  of  K .  Note  that  the  last  inequality  holds  for  any  sequence  rn.  We  are  going  to 
choose  rn  further  below  in  a  way  that  minimzes  the  bound  on  the  bias. 

Just  as  in  the  proof  of  Lemma  1,  let  l'y(i)  |  >  |'y(2)  |  >  •  •  ■  >  |T(p)  I  t>e  the  ordered  entries  of  7, 
let  un  be  some  sequence  for  n  — >  00  and  define  U  —  {k  :  (7^  <  |7(u„)|}.  Analogously  to  the 
proof  of  Lemma  1,  we  obtain  the  bound  (24)  in  slightly  different  notation  as 

7TC,7  >  0min(Mn)||7lli  -  2 vVw (n)(pmin(un) ||y \\i2 ||7(UC) \\e2 .  (40) 

By  the  same  argument  as  in  the  proof  of  Lemma  1,  it  also  holds  that  7TC*7  <  AUyH^/n. 
Just  as  in  (25),  we  have  additionally 

||7(^C)lk  <  IItII 
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and  hence 


V  llbllti/'^  —  0minDn)|  dl  l2  max(^,)0min(^n)  ||t||^2  IIT  \\ii  /  ‘sj'^nt 


which  is  equivalent  to 


(/’minCOIMIfe  <  IItIUi  {K/n  +  2\/ 'U^Vmin('Wn)0maxH  1 1 T I Ua )  -  (41) 

We  have  still  complete  freedom  in  choosing  the  sequences  {-un}n=1)  j0C  and  {rn}n=ir..jn.  We 
now  choose  un  =  rri\n,  where  m\n  is  given  in  (8)  with  m\n  x  n2X~2.  By  assumption,  there 
exists  some  k  >  0  such  that  hm  inf^^  0min(mA„)  >  k  and  the  last  equation  (41)  implies 
thus  that  there  exists  a  constant  c(k)  >  0  so  that  for  sufficiently  large  value  of  n, 

blll<kkAblk(i  +  blk)- 

As  long  as  we  focus  on  cases  for  which  \\^\\t2  stays  bounded  (it  will  turn  out  below  that  this 
holds  true),  it  follows  that 

bill  =o(^blk). 

Combing  with  the  bound  (39)  on  the  £ i -norm  of  7  then  yields 


bill  =  O(-)/ Vkblk  + 

We  can  still  choose  rn  freely.  We  choose  rn  to  make  both  terms  on  the  right  hand  side  of 
the  last  equation  of  the  same  order.  Using  in  particular  rn  x  (nsq^n/ Xn)q ,  it  follows  with 
rri\n  x  n2/X2  and  the  definition  (12)  of  the  effective  sparsity  as  s^f  =  sq2n that 


2  -  n 

12  ~  u 


q,n 


l—q/2 

rn  > 


=  O 


Sf  \  1-9/2 


which  completes  the  proof. 


□ 


5  Numerical  Illustration:  Frequency  Detection 

Instead  of  extensive  numerical  simulations,  we  would  like  to  illustrate  a  few  aspects  of  Lasso- 
type  variable  selection  if  the  irrepresentable  condition  is  not  fulfilled.  We  are  not  making 
claims  that  the  Lasso  is  superior  to  other  methods  for  high- dimensional  data.  We  merely 
want  to  draw  attention  to  the  fact  that  (a)  the  Lasso  might  not  be  able  to  select  the  correct 
variables  but  (b)  comes  nevertheless  close  to  the  true  vector  in  an  £2-sense. 

An  illustrative  example  is  frequency  detection.  It  is  of  interest  in  some  areas  of  the  physical 
sciences  to  accurately  detect  and  resolve  frequency  components;  two  examples  are  variable 
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Figure  1:  The  energy  logAA/u;)  for  a  noise  level  a  =  0.2  is  shown  on  the  left  for  a  range  of 
frequencies  to.  A  close-up  of  the  region  around  the  peak  is  shown  on  the  right.  The  two  frequencies 
uj\  and  0J2  are  marked  with  solid  vertical  lines,  while  the  resonance  frequency  (uq  +  W2)/2  is  shown 
with  a  broken  vertical  line. 


stars  (Pojmanski,  2002)  and  detection  of  gravitational  waves  (Cornish  and  Crowder,  2005; 
Umstatter  et  al.,  2005).  A  non-parametric  approach  is  often  most  suitable  for  fitting  of  the 
involved  periodic  functions  (Hall  et  ah,  2000).  However,  we  assume  here  for  simplicity  that 
the  observations  Y  —  (Yi, . . . ,  Yn)  at  time  points  t  —  (ti, . . . ,  tn)  are  of  the  form 

Yi  =  yjpu  sm(2nu)ti  +  fa)  +  eu 

where  fl  contains  the  set  of  fundamental  frequencies  involved,  and  £%  for  i  =  1 ,  ...,n  is 
independently  and  identically  distributed  noise  with  e*  ~  0,  a2).  To  simplify  the  problem 

even  more,  we  assume  that  the  phases  are  known  to  be  zero,  —  0  for  all  oj  e  fh  Otherwise 
one  might  like  to  employ  the  Group  Lasso  (Yuan  and  Lin,  2006),  grouping  together  the  sine 
and  cosine  part  of  identical  frequencies. 

It  is  of  interest  to  resolve  closely  adjacent  spectral  lines  (Hannan  and  Quinn,  1989)  and  we 
will  work  in  this  setting  in  the  following.  We  choose  for  the  experiment  n  =  200  evenly 
spaced  observation  times.  There  are  supposed  to  be  two  closely  adjacent  frequencies  with 
nq  =  0.0545  and  U2  =  0.0555  =  +  1/300,  both  entering  with  (3Wl  =  (3U2  =  1.  As  we  have 

the  information  that  the  phase  is  zero  for  all  frequencies,  the  predictor  variables  are  given 
by  all  sine-functions  with  frequencies  evenly  spaced  between  1/200  and  1/2,  with  a  spacing 
of  1/600  between  adjacent  frequencies. 

In  the  chosen  setting,  the  irrepresentable  condition  is  violated  for  the  frequency  ujm  =  (uq  + 
cn2)/2.  Even  in  the  absence  of  noise,  this  resonance  frequency  is  included  in  the  Lasso- 
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estimate  for  all  positive  penalty  parameters,  as  can  be  seen  from  the  results  further  below. 
As  a  consequence  of  a  violated  irrepresentable  condition ,  the  largest  peak  in  the  periodgram 
is  in  general  obtained  for  the  resonance  frequency.  In  Figure  1  we  show  the  periodogram 
(Scargle,  1982)  under  a  moderate  noise  level  a  =  0.2.  The  periodogram  shows  the  amount 
of  energy  in  each  frequency,  and  is  defined  through  the  function 

i  i 

where  is  the  least  squares  fit  of  the  observations  Y ,  using  only  sine  and  cosine  functions 
with  frequency  oo  as  two  predictor  variables.  There  is  clearly  a  peak  at  frequency  uim.  As 
can  be  seen  in  the  close-up  around  um,  it  is  not  immediately  obvious  from  the  periodogram 
that  there  are  two  frequencies  at  frequencies  u>\  and  U2-  As  said  above,  the  irrepresentable 
condition  is  violated  for  the  resonance  frequency  and  it  is  of  interest  to  see  which  frequencies 
are  picked  up  by  the  Lasso  estimator. 

The  results  are  shown  in  Figures  2  and  3.  Figure  3  highlights  that  the  two  true  frequencies 
are  with  high  probability  picked  up  by  the  Lasso.  The  resonance  frequency  is  also  selected 
with  high  probability,  no  matter  how  the  penalty  is  chosen.  This  result  could  be  expected  as 
the  irrepresentable  condition  is  violated  and  the  estimator  can  thus  not  be  sign  consistent. 
We  expect  from  the  theoretical  results  in  this  manuscript  that  the  coefficient  of  the  falsely 
selected  resonance  frequency  is  very  small  if  the  penalty  parameter  is  chosen  correctly.  And 
it  can  indeed  be  seen  in  Figure  2  that  the  coefficients  of  the  true  frequencies  are  much  larger 
than  the  coefficient  of  the  resonance  frequency  for  an  appropriate  choice  of  the  penalty 
parameter. 

These  results  reinforce  our  conclusion  that  the  Lasso  might  not  be  able  to  pick  up  the  correct 
sparsity  pattern,  but  delivers  nevertheless  useful  approximations  as  falsely  selected  variables 
are  chosen  only  with  a  very  small  coefficient;  this  behavior  is  typical  and  expected  from  the 
results  of  Theorem  1.  Falsely  selected  coefficients  can  thus  be  removed  in  a  second  step, 
either  by  thresholding  variables  with  small  coefficients  or  using  other  relaxation  techniques. 
In  any  case,  it  is  reassuring  to  know  that  all  important  variables  are  included  in  the  Lasso 
estimate. 


6  Concluding  Remarks 

It  has  recently  been  discovered  that  the  Lasso  cannot  recover  the  correct  sparsity  pattern  in 
certain  circumstances,  even  not  asymptotically  for  p  fixed  and  n  — >  oo.  This  shed  a  little 
doubt  on  whether  the  Lasso  is  a  good  method  for  identification  of  sparse  models  for  both 
low-  and  high-dimensional  data. 

Here  we  have  shown  that  the  Lasso  can  continue  to  deliver  good  approximations  to  sparse 
coefficient  vectors  f3  in  the  sense  that  the  ^-difference  \\/3  —  /3A||r2  vanishes  for  large  sample 
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Figure  2:  An  example  where  the  Lasso  is  bound  to  select  wrong  variables,  while  being  a  good 
approximation  to  the  true  vector  in  the  ^2-sense.  Top  row:  The  noise  level  increases  from  left  to 
right  as  a  =  0, 0.1,  0.2, 1.  For  one  run  of  the  simulation,  paths  of  the  estimated  coefficients  are 
shown  as  a  function  of  the  square  root  y/\  of  the  penalty  parameter.  The  actually  present  signal 
frequencies  ui\  and  uj-i  are  shown  as  solid  lines,  the  resonance  frequency  as  a  broken  line,  and  all 
other  frequencies  are  shown  as  dotted  lines.  Bottom  row:  the  shaded  areas  contain,  for  90%  of  all 
simulations,  the  regularization  paths  of  the  signal  frequencies  (region  with  solid  borders),  resonance 
frequency  (area  with  broken  borders)  and  all  other  frequencies  (area  with  dotted  boundaries).  The 
path  of  the  resonance  frequency  displays  reverse  shrinkage ,  as  its  coefficient  gets  in  general  smaller 
for  smaller  values  of  the  penalty.  As  expected  from  the  theoretical  results,  if  the  penalty  parameter 
is  chosen  correctly,  it  is  possible  to  separate  the  signal  and  resonance  frequencies  for  sufficiently 
low  noise  levels  by  just  retaining  large  and  neglecting  small  coefficients.  It  is  also  apparent  that 
the  coefficient  of  the  resonance  frequency  is  small  for  a  correct  choice  of  the  penalty  parameter  but 
very  seldom  identically  zero. 
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Figure  3:  The  top  row  shows  the  tb-distance  between  (5  and  /3A  separately  for  the  signal  frequencies 
(solid  blue  line),  resonance  frequency  (broken  red  line)  and  all  other  frequencies  (dotted  gray  line). 
It  is  evident  that  the  distance  is  quite  small  for  all  three  categories  simultaneously  if  the  noise 
level  is  sufficiently  low  (the  noise  level  is  again  increasing  from  left  to  right  as  a  =  0,0.1, 0.2,1). 
The  bottom  row  show  on  the  other  hand  the  average  number  of  selected  variables  (with  non¬ 
zero  estimated  regression  coefficient)  in  each  of  the  three  categories  as  a  function  of  the  penalty 
parameter.  It  is  impossible  to  choose  the  correct  model,  as  the  resonance  frequency  is  always 
selected,  no  matter  how  low  the  noise  level  and  no  matter  how  the  penalty  parameter  is  chosen. 
This  illustrates  that  sign  consistency  does  not  hold  if  the  irrepresentable  condition  is  violated,  even 
though  the  estimate  can  be  close  to  the  true  vector  (j  in  the  ^2-sense. 
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sizes  n,  even  if  it  fails  to  discover  the  correct  sparsity  pattern.  The  conditions  needed  for 
a  good  approximation  the  f^-sense  are  weaker  than  the  irrepresentable  condition  needed  for 
sign  consistency.  We  pointed  out  that  the  correct  sparsity  pattern  could  be  recovered  in 
a  two-stage  procedure.  The  first  step  consists  in  a  regular  Lasso  fit.  Variables  with  small 
absolute  coefficient  are  then  removed  from  the  model  in  a  second  step. 

We  derived  possible  scenarios  under  which  .^-consistency  can  be  achieved  as  a  function  of 
the  sparsity  of  the  vector  f3,  the  number  of  samples  and  the  number  of  variables.  The  only 
condition  on  the  design  matrix  we  impose  is  that  singular  minimal  eigenvalues  of  the  design 
matrix  induced  by  selecting  a  small  number  of  variables  are  bounded  away  from  zero  by  an 
arbitrarily  small  constant. 

It  was  also  shown  that  recovery  of  sparse  vectors  f3  is  possible  if  sparseness  is  measured  in 
other  ways  than  the  number  of  non- zero  entries  of  (3.  We  obtain  recovery  of  vectors  in  weak 
fq-balls  with  0  <  q  <  1.  In  summary,  the  Lasso  is  selecting  all  sufficiently  large  coefficients, 
and  possibly  some  other  unwanted  variables.  The  number  of  variables  can  thus  be  narrowed 
down  considerably  with  the  Lasso,  while  keeping  all  important  variables.  These  results  will 
hopefully  support  that  the  Lasso  is  a  useful  model  identification  method  for  high-dimensional 
data. 
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